Introduction


Prosper, a peer-to-peer lending company, kept a dataset that contained 113,937 loans with 81 variables per loan. The database can be downloaded from this url: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv

I chose this dataset to help increase my own personal financial literacy and health. Mortgages can make or break someone’s adult life; and either leave them happy and care free or crippled under debt. Through this exploration I hope to get a better understanding of what variables seem to influence the interest rate and the loan amount.

Since there’s no variable for loan amount approved, I’ll mainly focus on variables that influence borrower rates.

#Load Dataset
loanData <- read.csv('C:/Users/Admin/Downloads/prosperLoanData.csv')
names(loanData)
##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"
str(loanData)  
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
library(dplyr)
# Finds the number of NAs for each column. 
colSums(is.na(loanData))
##                          ListingKey                       ListingNumber 
##                                   0                                   0 
##                 ListingCreationDate                         CreditGrade 
##                                   0                                   0 
##                                Term                          LoanStatus 
##                                   0                                   0 
##                          ClosedDate                         BorrowerAPR 
##                                   0                                  25 
##                        BorrowerRate                         LenderYield 
##                                   0                                   0 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                               29084                               29084 
##                     EstimatedReturn             ProsperRating..numeric. 
##                               29084                               29084 
##               ProsperRating..Alpha.                        ProsperScore 
##                                   0                               29084 
##           ListingCategory..numeric.                       BorrowerState 
##                                   0                                   0 
##                          Occupation                    EmploymentStatus 
##                                   0                                   0 
##            EmploymentStatusDuration                 IsBorrowerHomeowner 
##                                7625                                   0 
##                    CurrentlyInGroup                            GroupKey 
##                                   0                                   0 
##                    DateCreditPulled               CreditScoreRangeLower 
##                                   0                                 591 
##               CreditScoreRangeUpper             FirstRecordedCreditLine 
##                                 591                                   0 
##                  CurrentCreditLines                     OpenCreditLines 
##                                7604                                7604 
##          TotalCreditLinespast7years               OpenRevolvingAccounts 
##                                 697                                   0 
##         OpenRevolvingMonthlyPayment                InquiriesLast6Months 
##                                   0                                 697 
##                      TotalInquiries                CurrentDelinquencies 
##                                1159                                 697 
##                    AmountDelinquent             DelinquenciesLast7Years 
##                                7622                                 990 
##            PublicRecordsLast10Years           PublicRecordsLast12Months 
##                                 697                                7604 
##              RevolvingCreditBalance                 BankcardUtilization 
##                                7604                                7604 
##             AvailableBankcardCredit                         TotalTrades 
##                                7544                                7544 
##  TradesNeverDelinquent..percentage.             TradesOpenedLast6Months 
##                                7544                                7544 
##                   DebtToIncomeRatio                         IncomeRange 
##                                8554                                   0 
##                    IncomeVerifiable                 StatedMonthlyIncome 
##                                   0                                   0 
##                             LoanKey                   TotalProsperLoans 
##                                   0                               91852 
##          TotalProsperPaymentsBilled               OnTimeProsperPayments 
##                               91852                               91852 
## ProsperPaymentsLessThanOneMonthLate     ProsperPaymentsOneMonthPlusLate 
##                               91852                               91852 
##            ProsperPrincipalBorrowed         ProsperPrincipalOutstanding 
##                               91852                               91852 
##         ScorexChangeAtTimeOfListing           LoanCurrentDaysDelinquent 
##                               95009                                   0 
##       LoanFirstDefaultedCycleNumber          LoanMonthsSinceOrigination 
##                               96985                                   0 
##                          LoanNumber                  LoanOriginalAmount 
##                                   0                                   0 
##                 LoanOriginationDate              LoanOriginationQuarter 
##                                   0                                   0 
##                           MemberKey                  MonthlyLoanPayment 
##                                   0                                   0 
##                 LP_CustomerPayments        LP_CustomerPrincipalPayments 
##                                   0                                   0 
##                  LP_InterestandFees                      LP_ServiceFees 
##                                   0                                   0 
##                   LP_CollectionFees               LP_GrossPrincipalLoss 
##                                   0                                   0 
##                 LP_NetPrincipalLoss     LP_NonPrincipalRecoverypayments 
##                                   0                                   0 
##                       PercentFunded                     Recommendations 
##                                   0                                   0 
##          InvestmentFromFriendsCount         InvestmentFromFriendsAmount 
##                                   0                                   0 
##                           Investors 
##                                   0

Through the use of the is.na function, we can determine the number of NAs in each variable. This helps me filter our variables with too many NA’s.

Univariate Plots


library(ggplot2)
ggplot(aes(x = loanData$StatedMonthlyIncome), data = loanData) + 
  xlab('Stated Monthly Income') + 
  ylab('Number of Borrowers') +
  geom_histogram() + 
  scale_x_continuous(limits = c(0, 12266))

summary(loanData$StatedMonthlyIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003
loanData <- subset(loanData, StatedMonthlyIncome <= 100000)

Through the summary function on the variable: Stated Monthly Income; we learned that the median income is $4667 with an IQR of $3625. That means that outliers would be found above $12,265.5 and below $2,237.5. I’ve confingured the plot to only show stated monthly income’s between $0 and $12,266. You can see that the data is skewed right with more borrowers between the 1st quartile $3,200 and the 3rd quartile $6816.

loanData$EmploymentStatus <- ordered(loanData$EmploymentStatus, 
                                     levels=c("Employed", 
                                              "Full-time", 
                                              "Part-time", 
                                              "Self-employed", 
                                              "Retired", 
                                              "Not employed", 
                                              "Not available", 
                                              "Other", 
                                              ""))

ggplot(data = loanData, aes(x=loanData$EmploymentStatus)) +
  geom_bar() + 
  xlab("Employment Status") +
  ylab("Number of Borrowers")

  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## List of 1
##  $ axis.text.x:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 1
##   ..$ vjust        : NULL
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
summary(loanData$EmploymentStatus)
##      Employed     Full-time     Part-time Self-employed       Retired 
##         67316         26353          1088          6128           795 
##  Not employed Not available         Other               
##           835          5344          3806          2255

It’s hard to say what Employed means in this case. Looking at Full-Time, Part-time, and Self-Employed, which are typical employment situations; the numbers don’t seem to add up to the number of borrowers employed.

ggplot(aes(x=EmploymentStatusDuration), data=loanData) +
  geom_histogram(bins=100) +
  xlab("Employment Status Duration in Months") +
  ylab('Number of Borrowers') +
  ggtitle("Employment Status Duration Histogram") +
  scale_x_continuous(limits = c(0, 303.5))

summary(loanData$EmploymentStatusDuration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7622

Here, we go a bit deeper into Employment looking specifically at status of how long someone has been employed in months. Running the summary function on employment status duration reveals that the median length of employment in 67 months. The max of 755 is skewing the data right and possibly conflicting the graph. Finding out outliers are above 303.5, I’ve limited the x-axis to better visualize the data. Is the employment status, Employed, a situational matter having to do with a certain length of months employed? This may need further scrutiny.

p1 <- ggplot(aes(loanData$LoanOriginalAmount), data = loanData) +
  geom_histogram() +
  xlab('Original Loan Amount') +
  ylab('Number of Borrowers')

#Created breaks to determine possible location of spikes
p2 <- p1 +
  scale_x_continuous(breaks = seq(0, 40000, 3500))


library(gridExtra)
grid.arrange(p1, p2, ncol = 1)

summary(loanData$LoanOriginalAmount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

It’s interesting to note that there are several spikes on certain locations of the data. I changed the X-axis to better determine why that might be. At first, I believed they may be at $5000 intervals, it seems the first 3 spikes are at $3500, $7000, and $10,500. I’m not sure why this may be, this needs further scrutiny. Also, it seems at least 75% of the loans are under $12,000.

loanData$CreditGrade <- ordered(loanData$CreditGrade, 
                                levels=c('AA', 
                                         'A', 
                                         'B', 
                                         'C', 
                                         'D', 
                                         'E', 
                                         'HR', 
                                         'NC', 
                                         ''))

qplot(loanData$CreditGrade) +
  xlab("Credit Grade")

summary(loanData$CreditGrade)
##    AA     A     B     C     D     E    HR    NC       
##  3507  3314  4389  5648  5153  3288  3508   141 84972

Several thousand loans, don’t have a credit grade assigned.

# Plot histogram for CreditScoreRangeLower with 100 bins
p1 <-
  ggplot(aes(x=CreditScoreRangeLower), data=loanData) +
    geom_histogram(bins=100) +
    xlab('Credit Score Lower') +
    ggtitle("Credit Score Range Lower Histogram")

# Plot histogram for CreditScoreRangeUpper with 100 bins
p2 <-
  ggplot(aes(x=CreditScoreRangeUpper), data=loanData) +
    geom_histogram(bins=100) +
    xlab('Credit Score Upper') +
    ggtitle("Credit Score Range Upper Histogram")

#Created a new variable
loanData$CreditScoreAverage <- (loanData$CreditScoreRangeLower +
                                  loanData$CreditScoreRangeUpper) / 2

p3 <-  
  ggplot(aes(x=loanData$CreditScoreAverage), data=loanData) +
    geom_histogram(bins=100) + 
    xlab('Credit Score Average') +
    ggtitle("Credit Score Average Histogram")

grid.arrange(p1, p2, p3, ncol = 1)

summary(loanData$CreditScoreAverage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

All three histograms look fairly similar. The first two hisograms represent the lower and upper range of credit scores, while the third represents the average between the lower and upper range. Most average credit scores seem to be in between 600 and 800.

ggplot(aes(x = loanData$BorrowerRate), data = loanData) +
  geom_histogram() +
  xlab('Borrowing Rate') +
  scale_x_continuous(limits = c(0.02, 0.38))

loanData %>%
  group_by(BorrowerRate) %>%
  summarise(Count=n()) %>%
  arrange(desc(Count))
## # A tibble: 2,294 x 2
##    BorrowerRate Count
##           <dbl> <int>
##  1        0.318  3671
##  2        0.350  1905
##  3        0.320  1648
##  4        0.290  1508
##  5        0.270  1319
##  6        0.150  1182
##  7        0.140  1035
##  8        0.110   949
##  9        0.200   907
## 10        0.158   806
## # ... with 2,284 more rows
summary(loanData$BorrowerRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

The graph is almost right skewed (if it didn’t have the peak around .36). The interest rates don’t show units though, and most likely the interest rates aren’t between 5% to 35%, so we’d have to assume they are between 0.5% to 3.5%.

loanData$BorrowerRate.bucket <- cut(loanData$BorrowerRate, breaks = 8)
table(loanData$BorrowerRate.bucket)
## 
## (-0.000498,0.0622]     (0.0622,0.124]      (0.124,0.187] 
##                861              22753              34820 
##      (0.187,0.249]      (0.249,0.311]      (0.311,0.373] 
##              24966              19781              10730 
##      (0.373,0.435]      (0.435,0.498] 
##                  3                  6

Here cutting the variable, turns the rates into categorical variables and then can determine the counts of the rates between each category. Although the highest count borrower rate is .3177, regardless of the differing level of breaks, there are other rates, which more counts seem to fall under.

loanData$LoanStatusGeneral <- loanData$LoanStatus

levels(loanData$LoanStatusGeneral) <- c(levels(loanData$LoanStatusGeneral), 
                                        'PastDue')

past_due_rows <- grep('^Past', loanData$LoanStatusGeneral)

loanData$LoanStatusGeneral[past_due_rows] <- 'PastDue'

loanData$LoanStatusGeneral <- ordered(loanData$LoanStatusGeneral, 
                                      levels=c("Completed", 
                                               "FinalPaymentInProgress", 
                                               "Current", 
                                               "PastDue", 
                                               "Defaulted", 
                                               "Chargedoff", 
                                               "Cancelled"))

ggplot(loanData,
       aes(x=LoanStatusGeneral)) + 
  geom_bar() +
  xlab('Loan Status General') + 
  ylab('Number of Borrowers') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

summary(loanData$LoanStatusGeneral)
##              Completed FinalPaymentInProgress                Current 
##                  38066                    205                  56569 
##                PastDue              Defaulted             Chargedoff 
##                   2067                   5018                  11990 
##              Cancelled 
##                      5

Most loans to seem to be current or completed. Charged off is also notable here. It’s curious that 5 loans have been canceled.

# Plot histogram for Monthly Loan Payment
ggplot(aes(x=MonthlyLoanPayment), data=loanData) +
  geom_histogram(bins=100) +
  xlab("Monthly Loan Payment") + 
  scale_x_continuous(limits = c(0, 750)) +
  ggtitle("Monthly Loan Payment Histogram")

summary(loanData$MonthlyLoanPayment)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2251.5

After accounting for outliers, most of the monthly loan payments seem to be lower than approximately $371.6.

Univariate Analysis


Dataset Structure

The original dataset contained 113,937 observations of 81 variables. Some of these variables include:

  • loan amount - most are under $12,000,
  • credit grade of the loan (mostly missing),
  • loan status (mainly completed, current, or defaulted),
  • employment status (high number of employed numbers, but unable to gather how the number was reached),
  • employment status duration (median 67 months),
  • borrower credit score (mainly between 600 and 800),
  • borrower interest rate (mainly .3177, but categorically a higher number are between 0.124 - 0.187), and
  • monthly loan payments are lower than $371.6

Main Features of Interest

I’m curious of the correlation between income related variables, loan status, and credit score.

Other Potentially Interesting Features

Everyone is different (interest rates, monthly income, monthly loan payments, employment status, occupation, and credit score). Depending on how these stack up, these variables can affect loan repayment and credit score. For example, people with lower monthly incomes are possibly correlated to low or no credit.

New Variables Created

I created two new variables:

BorrowerRate.bucket(using the cut function to turn borrower rate into a categorical variable) CreditScoreAverage (a function of CreditScoreRangeLower and CreditScoreRangeUpper).

Interesting Distributions

The StatedMonthlyIncome distribution was right-skewed with a median of $4667. There seemed to be multiple outliers, especially since the max monthly income was $1,750,003. Credit score average and borrower rate were normally distributed with a median of 689.5 and 0.184 respectively.

Bivariate Plots


The first thing I’ll do is get a quick high-level overview of which interesting numeric values are correlated.

library(magrittr)
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 3.4.4
cor_matric <- cor(loanData[ , c('StatedMonthlyIncome', 
                  'CreditScoreAverage', 
                  'ProsperScore',
                  'BorrowerRate',
                  'EstimatedReturn',
                  'OpenCreditLines',
                  'TotalCreditLinespast7years',
                  'InquiriesLast6Months',
                  'CurrentDelinquencies',
                  'LoanOriginalAmount',
                  'DebtToIncomeRatio')], 
    use="complete.obs")

ggcorrplot(cor_matric, 
           hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           title="Correlogram of loanData quant variables", 
           ggtheme=theme_bw)

Using the above functions, we are able to determine 4 quantitative variables that have strong correlations with each other (positively or negatively):

  • BorrowerRate and EstimatedReturn (0.83)
  • ProsperScore and BorrowerRate (-0.66)
  • OpenCreditLines and TotalCreditLinespast7years (0.57)
  • CreditScoreAverage and BorrowerRate (-0.53)
# StateMonthly Income greater than 10 and apply log scale on StatedMonthlyIncome
ggplot(aes(x=CreditScoreAverage,y=StatedMonthlyIncome), 
       data=subset(loanData,
                     CreditScoreAverage>300 &
                     StatedMonthlyIncome>10)) +
  xlab('Credit Score Average') +
  ylab('Stated Monthly Income') +
  geom_point(shape=21, alpha=1/15, position='jitter', 
             colour='red', fill='black', size=1.2, stroke=1) +
  scale_y_log10() +
  ggtitle("CreditScore vs. log(Monthly Income)")

# Create subset where StatedMonthlyIncome greater than 10
filter_monthlyIncome <- subset(loanData,StatedMonthlyIncome >10)

# Linear Model between CreditScoreAverage and StatedMonthlyIncome
f <- lm(log(filter_monthlyIncome$StatedMonthlyIncome) 
        ~ filter_monthlyIncome$CreditScoreAverage)
summary(f)
## 
## Call:
## lm(formula = log(filter_monthlyIncome$StatedMonthlyIncome) ~ 
##     filter_monthlyIncome$CreditScoreAverage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1383 -0.3443  0.0184  0.3700  3.5094 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                             6.883e+00  1.908e-02  360.77
## filter_monthlyIncome$CreditScoreAverage 2.255e-03  2.733e-05   82.53
##                                         Pr(>|t|)    
## (Intercept)                               <2e-16 ***
## filter_monthlyIncome$CreditScoreAverage   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6054 on 111633 degrees of freedom
##   (578 observations deleted due to missingness)
## Multiple R-squared:  0.05751,    Adjusted R-squared:  0.0575 
## F-statistic:  6811 on 1 and 111633 DF,  p-value: < 2.2e-16
cor.test(
  log(filter_monthlyIncome$StatedMonthlyIncome), 
  filter_monthlyIncome$CreditScoreAverage)
## 
##  Pearson's product-moment correlation
## 
## data:  log(filter_monthlyIncome$StatedMonthlyIncome) and filter_monthlyIncome$CreditScoreAverage
## t = 82.532, df = 111630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2342713 0.2453288
## sample estimates:
##       cor 
## 0.2398078

After transforming the data, the graph is more linearlized and shows a bit more correlation.

# jitter plot between StatedMonthlyIncome and LoanOriginalAmount
# apply log scale on x axis and filter StatedMonthlyIncome greater than 10
ggplot(aes(x=StatedMonthlyIncome,y=LoanOriginalAmount), 
       data=subset(loanData,StatedMonthlyIncome>10)) +
  scale_x_log10() +
  geom_point(shape=21, alpha=1/20, position='jitter', 
             colour='orange', fill='black', size=1.5, stroke=.5) +
  xlab('Stated Monthly Income') +
  ylab('Loan Original Amount') +
  ggtitle("Log of Stated Monthly Income vs. Loan Amount")

# Linear Model between StatedMonthlyIncome and LoanOriginalAmount
f <- lm(LoanOriginalAmount ~ StatedMonthlyIncome, 
        data=subset(loanData,StatedMonthlyIncome>10))
summary(f)
## 
## Call:
## lm(formula = LoanOriginalAmount ~ StatedMonthlyIncome, data = subset(loanData, 
##     StatedMonthlyIncome > 10))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57144  -4171  -1685   3179  25096 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.143e+03  2.953e+01   174.1   <2e-16 ***
## StatedMonthlyIncome 5.713e-01  4.242e-03   134.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5796 on 112211 degrees of freedom
## Multiple R-squared:  0.1392, Adjusted R-squared:  0.1392 
## F-statistic: 1.814e+04 on 1 and 112211 DF,  p-value: < 2.2e-16
# Correlation test between LoanOriginalAmount and log of StatedMonthlyIncome 
tmp <- subset(loanData,StatedMonthlyIncome>10)
cor.test(tmp$LoanOriginalAmount, 
         log(tmp$StatedMonthlyIncome))
## 
##  Pearson's product-moment correlation
## 
## data:  tmp$LoanOriginalAmount and log(tmp$StatedMonthlyIncome)
## t = 155.7, df = 112210, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4166707 0.4262938
## sample estimates:
##       cor 
## 0.4214941

The log of monthly income and loan amount show a linear relationship. Their seems to be horizontal lines at or near intervals of $5,000 up to approximately $25,000. This may show when loan amounts are rounded, as this makes sense as most people tend to think in terms of round numbers (this isn’t how much the borrower is approved for, just how much they asked for).

#Plot a scatter plot beween Stated Monthly Income (filtered for income less than $25,000) and Borrwer Rates
p1 <- ggplot(subset(loanData, StatedMonthlyIncome < 25000),
       aes(x= StatedMonthlyIncome,
           y=BorrowerRate)) + 
  xlab('Stated Monthly Income') +
  ylab('Borrower Rate') +
  geom_point(shape=21, alpha=1/25, position='jitter', 
             colour='blue', fill='black', size=1.2, stroke=1)

#Plot a scatter plot between Average Credit Score (greater than 300) and Borrower Rates
p2 <- ggplot(aes(x= CreditScoreAverage,
           y=BorrowerRate),
           data = subset(loanData, CreditScoreAverage>300)) + 
  xlab('Credit Score Average') +
  ylab('Borrower Rate') +
  geom_point(shape=21, alpha=1/45, position='jitter', 
             colour='red', fill='black', size=1.2, stroke=1) + 
  coord_cartesian(xlim = c(450, 900))

#Plot a scatter plot between Prosper Score and Borrower Rate
p3 <- ggplot(aes(x=ProsperScore, y = BorrowerRate),
             data = subset(loanData, 
                           !is.na(ProsperScore))) +
  geom_point(shape=21, alpha=1/35, position='jitter', 
             colour='brown', fill='black', size=1.2, stroke=1) +
  xlab('Prosper Score') +
  ylab('Borrower Rate')

#Plot a scatter plot between Borrower Rate and Estimated Return
p4 <- ggplot(aes(BorrowerRate, EstimatedReturn), 
             data = loanData) + 
  geom_point(shape=21, alpha=1/25, position='jitter', 
             colour='purple', fill='black', size=1.2, stroke=1)

grid.arrange(p1, p2, p3, p4, ncol = 2)

#Correlation test between Stated Monthly Income and Borrower Rate
tmp <- subset(loanData,StatedMonthlyIncome < 25000)
cor.test(tmp$BorrowerRate,
         tmp$StatedMonthlyIncome, 
         type='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  tmp$BorrowerRate and tmp$StatedMonthlyIncome
## t = -66.796, df = 113280, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2002596 -0.1890543
## sample estimates:
##        cor 
## -0.1946633
#Correlation test between Average Credit Score and Borrower Rate
tmp1 <- subset(loanData,CreditScoreAverage>300)
cor.test(tmp$BorrowerRate,
         tmp$CreditScoreAverage, 
         type='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  tmp$BorrowerRate and tmp$CreditScoreAverage
## t = -174.47, df = 112700, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4657252 -0.4565316
## sample estimates:
##        cor 
## -0.4611407
#Correlation test between Prosper Score and Borrower Rate
tmp2 <- subset(loanData,!is.na(ProsperScore))
cor.test(tmp$BorrowerRate,
         tmp$ProsperScore, 
         type='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  tmp$BorrowerRate and tmp$ProsperScore
## t = -248, df = 84336, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6532948 -0.6454893
## sample estimates:
##        cor 
## -0.6494092
#Correlation test between Borrower Rate and Estimated Return
tmp3 <- subset(loanData, EstimatedReturn > 0)
cor.test(tmp$BorrowerRate,
         tmp$EstimatedReturn, 
         type='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  tmp$BorrowerRate and tmp$EstimatedReturn
## t = 412.46, df = 84336, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8154121 0.8198859
## sample estimates:
##       cor 
## 0.8176613

Above you see four graphs. Plot 1: Looks at Stated Monthly Income compared to Borrower Rate. There seems to be some negative correlation of -0.19. Plot 2: Looks at Average Credit Score compared to Borrower Rate. There is a negative correlation of -0.46 (as credit score increases, there is a correlation that borrower rate decreases). Plot 3: Looks at Prosper Score (NA values are ignored) compared to Borrower Rate. There is a strong negative correlation of -0.65 (as Prosper score increases, borrower rates decrease). Plot 4: Looks at Estimated Return and how much does Borrower rate affect it. There is a strong positive correlation of .82. Estimated return could be a function for borrower rate. This needs more scrutiny.

ggplot(loanData,
       aes(x= LoanStatusGeneral,
           y=CreditScoreAverage)) + 
  geom_boxplot() +
  xlab('Loan Status General') +
  ylab('Credit Score Average') + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Here, median credit scores seem to illustrate that lower credit scores have a higher chance of defaulting or being charged off.

# jitter plot Prosper Score and Credit Score Average
# filter Credit Score Average Without na values and greater than 300
ggplot(aes(x=ProsperScore,y=CreditScoreAverage), 
       data=subset(loanData,!is.na(CreditScoreAverage) &
                     CreditScoreAverage > 300 &
                     !is.na(ProsperScore))) +
  geom_point(shape=21, alpha=1/25, position='jitter', 
             colour='blue', fill='black', size=1.2, stroke=1) +
  xlab('Prosper Score') +
  ylab('Credit Score Average') +
  ggtitle("Prosper Score vs. Credit Score Average")

It seems that after a prosper score of 10, credit score decreases. People with a good credit score, tend to have a prosper score of 10.

Bivariate Analysis


Relationships Found With Main Features of Interest

The log of monthly income and loan amount show a linear relationship. Their seems to be horizontal lines at or near intervals of $5,000 up to approximately $25,000. Their is a correlation found using the log of monthly income at 0.42. From this data, we can gather that for those who have a monthly income of less than $8000, their loan amounts tend not to be greater than $25,000.

Visually, there seemed to be little or no correlation between stated monthly income and borrower rate. By running a correlation test, there seems to be a slight correlation of -0.19.

Average Credit Score, Prosper Score, and Estimated Return had correlation scores of the following: -0.46, -0.65, and 0.82. Judging from the higher negative correlations we can associate that borrower rate decreases as average credit score and prosper score increase. On the other hand, estimated return has a strong positive correlation with borrower rate, meaning as borrower rate increases, estimated return increases.

Other Interesting Relationships

Lower credit scores seems to relate to having a higher chance of defaulting or charging off.

Up until approximately a prosper score of 10 their seems to be a possible relationship between prosper score and credit score.

Strongest Relationships Found

The strongest relationship I found was between BorrowerRate and EstimatedReturn. Also there was one between ProsperScore and BorrowerRate.

Multivariate Plots


library(ggalt)
## Warning: package 'ggalt' was built under R version 3.4.4
ggplot(loanData, aes(x=StatedMonthlyIncome, 
                     y=LoanOriginalAmount, 
                     color=LoanStatusGeneral)) + 
  geom_jitter(alpha = 1/10) + 
  xlab('Stated Monthly Income') +
  ylab('Loan Original Amount') + 
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Loan Status',
                                          override.aes = list(alpha = 1, 
                                                              size = 2)),
                     direction = -1) +
  coord_cartesian(xlim = c(0, 30000), 
                  ylim = c(0, max(loanData$LoanOriginalAmount))) + 
  theme_dark()

There seem to be many completed loans where the original loan amount is fairly low. As the original loan amount increases and stated monthly income is lower than $10,000 you can see that Defaulted and Charged off loans are occuring more frequently.

ggplot(subset(loanData, EmploymentStatus != ''), 
       aes(x=StatedMonthlyIncome,
           y=LoanOriginalAmount,
           color=EmploymentStatus)) + 
  xlab('Stated Monthly Income') +
  ylab('Loan Original Amount') +
  geom_jitter(alpha = 1/5) + 
  scale_color_brewer(type = 'div',
                     palette = 4,
                     guide = guide_legend(title = 'Employment Status',
                                          override.aes = list(alpha = 1, 
                                                              size = 2)),
                     direction = -1) +
  #geom_smooth() +
  coord_cartesian(xlim = c(0,30000), 
                  ylim = c(0,max(loanData$LoanOriginalAmount)))

Here, we color the data for Employment Status. For the most part, the data is what is expected: Retired, not employed, not available have less than $3000, with most near $0.

p1 <- ggplot(subset(loanData, EmploymentStatus %in% c('Employed',
                                                'Full-time', 
                                                'Part-time', 
                                                'Self-employed')), 
       aes(x=StatedMonthlyIncome,
           y=LoanOriginalAmount,
           color=EmploymentStatus)) + 
  xlab('Stated Monthly Income') +
  ylab('Loan Original Amount') +
  geom_jitter(alpha = 1/10) + 
  scale_color_brewer(type = 'qual',
                     #palette = 1,
                     guide = guide_legend(title = 'Employment Status',
                                          override.aes = list(alpha = 1, 
                                                              size = 2)),
                     direction = -1) +
  #geom_smooth() +
  coord_cartesian(xlim = c(0,30000), 
                  ylim = c(0,max(loanData$LoanOriginalAmount))) + 
  theme_dark()

p2 <- ggplot(subset(loanData, EmploymentStatus %in% c('Full-time', 
                                                'Part-time', 
                                                'Self-employed')), 
       aes(x=StatedMonthlyIncome,
           y=LoanOriginalAmount,
           color=EmploymentStatus)) + 
  xlab('Stated Monthly Income') +
  ylab('Loan Original Amount') +
  geom_jitter(alpha = 1/7) + 
  scale_color_brewer(type = 'qual',
                     #palette = 1,
                     guide = guide_legend(title = 'Employment Status',
                                          override.aes = list(alpha = 1, 
                                                              size = 2)),
                     direction = -1) +
  coord_cartesian(xlim = c(0,30000), 
                  ylim = c(0,max(loanData$LoanOriginalAmount))) + 
  theme_dark()

grid.arrange(p1, p2, ncol = 1)

I’ve gone ahead and plotted out two versions of the map, one with the added employment status of employed and the other without. With the added employment status of employed, with a stated monthly income of approximately $12000 or less, the loan original amount seems to be below $20000. There are added noticable horizontal lines for both graphs, most likely because of rounding.

Without the employed employment status, the picture is a bit clearer. Most part time incomes under $2000 seem to ask for less than $10000 on their loans. Self-Employed incomes on the other hand seem to ask in intervals of approximately $4000, $10000, and $15000.

#Created a function using the substring of listing creation date. 
loanData$ListYear <- substring(loanData$ListingCreationDate,1,4) 

ggplot(subset(loanData, EmploymentStatus != ''), 
       aes(x=EmploymentStatus,
           y=LoanStatusGeneral)) + 
  xlab('Employment Status') +
  ylab('Loan Status General') +
  geom_point(shape=21, alpha=1/15, position='jitter', 
             colour='purple', fill='black', size=1.2, stroke=1) + 
  facet_wrap(~ListYear) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

These plots allow us to see how Prosper has collected data from 2006 to 2014. It seems employment status data wasn’t kept in 2006. In 2007, we don’t see the ‘Employed’ or ‘Other’ status data points. In 2009, there’s very few loans. Then in 2011, we see that the plots begin to increase desnity again.

ggplot(loanData,
       aes(x= CreditScoreAverage,
           y=BorrowerRate)) + 
  xlab('Credit Score Average') +
  ylab('Borrower Rate') +
  geom_jitter(alpha=1/8, aes(color = LoanOriginalAmount)) + 
  scale_colour_gradientn(colours=rainbow(4)) +
  geom_smooth() + 
  coord_cartesian(xlim=c(500,850)) + 
  facet_wrap(~ListYear)

Over time, we see an increase in the minimum credit score average. It seems that most loans in 2006 were under $10000. This seems to continue until around 2010, when you begin see higher loan amounts around $20000. It isn’t until 2013, we start seeing loan amounts reaching $30,000.

loanData$IncomeRange <- ordered(loanData$IncomeRange, 
                                levels=c("Not displayed", 
                                         "Not employed", 
                                         "$0", "$1-24,999",  
                                         "$25,000-49,999", 
                                         "$50,000-74,999", 
                                         "$75,000-99,999", 
                                         "$100,000+"))

ggplot(subset(loanData, IncomeRange != "Not displayed"),
       aes(x=ListYear,
           y=LP_InterestandFees/1000000,
           fill = IncomeRange)) + 
  xlab('Listing Year') +
  ylab('LP Interest and Fees/1,000,000') + 
  scale_fill_brewer('div',
                    palette = 2,
                    guide = guide_legend(title = 'Income Range')) +
  stat_summary(fun.y = sum,
               geom = "bar",
               position = 'stack')

After 2010, the number of higher income individuals that Prosper has made money from increased.

library(memisc)

m1 <- lm(BorrowerRate ~ ProsperScore, data = loanData)
m2 <- update(m1, ~ . + CreditScoreAverage)
m3 <- update(m2, ~ . + ListYear)

mtable(m1, m2, m3, sdigits = 3)
## 
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperScore, data = loanData)
## m2: lm(formula = BorrowerRate ~ ProsperScore + CreditScoreAverage, 
##     data = loanData)
## m3: lm(formula = BorrowerRate ~ ProsperScore + CreditScoreAverage + 
##     ListYear, data = loanData)
## 
## =======================================================================
##                              m1              m2              m3        
## -----------------------------------------------------------------------
##   (Intercept)                0.317***        0.645***        0.692***  
##                             (0.001)         (0.003)         (0.002)    
##   ProsperScore              -0.020***       -0.017***       -0.019***  
##                             (0.000)         (0.000)         (0.000)    
##   CreditScoreAverage                        -0.000***       -0.000***  
##                                             (0.000)         (0.000)    
##   ListYear: 2010/2009                                        0.010***  
##                                                             (0.001)    
##   ListYear: 2011/2009                                       -0.000     
##                                                             (0.001)    
##   ListYear: 2012/2009                                       -0.006***  
##                                                             (0.001)    
##   ListYear: 2013/2009                                       -0.058***  
##                                                             (0.001)    
##   ListYear: 2014/2009                                       -0.080***  
##                                                             (0.001)    
## -----------------------------------------------------------------------
##   R-squared                  0.422           0.506           0.678     
##   adj. R-squared             0.422           0.506           0.678     
##   sigma                      0.057           0.052           0.042     
##   F                      61977.291       43391.684       25570.970     
##   p                          0.000           0.000           0.000     
##   Log-likelihood        123062.508      129685.214      147929.100     
##   Deviance                 273.046         233.579         151.934     
##   AIC                  -246119.017     -259362.427     -295840.199     
##   BIC                  -246090.971     -259325.033     -295756.063     
##   N                      84841           84841           84841         
## =======================================================================

The model above helps to explain the variance in borrower rate. It seems that the variables: Prosper score, credit score average, and listing year are the variables, which seem to make up the model. Other variables were tested, but didn’t seem to have a significant effect.

Multivariate Analysis


New Discoveries in the Data

We saw a positive relationship between loan amount and year of the loan. Other than this, we didn’t discover any new relationships, but did learn about the nuances of previously discovered relationships. For example:

  • Many ‘Completed’ loans where LoanOriginalAmount is lower.
  • Defaulted and ChargedOff loans seem to occur when the StatedMonthlyIncome is low and the LoanOriginalAmount is high.
  • Most part time incomes under $2000 seem to ask for less than $10000 on their loans.
  • Self-employed people seemed to mainly ask for $4000, $10,000 or $15,000.
  • Employment status data wasn’t kept in 2006. In 2007, we don’t see the ‘Employed’ or ‘Other’ status data points.
  • In 2009, there’s very few loans. Then in 2011, we see that the plots begin to increase desnity again.
  • We see very few loans in 2009, after the 2008 financial crisis. 2010 picks back up at a bit. In 2011 the graph is dense again, with many self-employed loans.
  • Over time, we see an increase of the credit score average. It seems that most loans in 2006 were under $10000. This seems to continue until around 2010, when you begin see higher loan amounts around $20000. It isn’t until 2013, we start seeing loan amounts reaching $30,000.
  • After 2010 Prosper made more money from higher income individuals.

Linear Models

A model was made to explain the borrower rate variance. Prosper Score, Credit Score and Listing Year were used and explained 68% of the variance. Other variables were used, but they either had no effect or a negative effect.

Final Plots & Summary


Plot One

cor_matric <- cor(loanData[ , c('StatedMonthlyIncome', 
                  'CreditScoreAverage', 
                  'ProsperScore',
                  'BorrowerRate',
                  'EstimatedReturn',
                  'OpenCreditLines',
                  'TotalCreditLinespast7years',
                  'InquiriesLast6Months',
                  'CurrentDelinquencies',
                  'LoanOriginalAmount',
                  'DebtToIncomeRatio'
                  )], 
    use="complete.obs")

ggcorrplot(cor_matric, 
           hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           title="Correlogram of quantitative loanData variables", 
           legend.title = "Correlation",
           ggtheme=theme_bw)

This plot gives a bird’s eye view of the correlation between multiple quantitative variables. Here, we see that these varibles are strongly correlated:

  • BorrowerRate and EstimatedReturn (0.83)
  • ProsperScore and BorrowerRate (-0.66)
  • OpenCreditLines and TotalCreditLinespast7years (0.57)
  • CreditScoreAverage and BorrowerRate (-0.53)

Plot Two

p1 <- ggplot(subset(loanData, EmploymentStatus %in% c('Employed',
                                                'Full-time', 
                                                'Part-time', 
                                                'Self-employed')), 
       aes(x=StatedMonthlyIncome,
           y=LoanOriginalAmount,
           color=EmploymentStatus)) + 
  xlab('Stated Monthly Income') +
  ylab('Loan Original Amount') +
  geom_jitter(alpha = 1/10) + 
  scale_color_brewer(type = 'qual',
                     #palette = 1,
                     guide = guide_legend(title = 'Employment Status',
                                          override.aes = list(alpha = 1, 
                                                              size = 2)),
                     direction = -1) +
  #geom_smooth() +
  coord_cartesian(xlim = c(0,30000), 
                  ylim = c(0,max(loanData$LoanOriginalAmount))) + 
  theme_dark()

p2 <- ggplot(subset(loanData, EmploymentStatus %in% c('Full-time', 
                                                'Part-time', 
                                                'Self-employed')), 
       aes(x=StatedMonthlyIncome,
           y=LoanOriginalAmount,
           color=EmploymentStatus)) + 
  xlab('Stated Monthly Income') +
  ylab('Loan Original Amount') +
  geom_jitter(alpha = 1/7) + 
  scale_color_brewer(type = 'qual',
                     #palette = 1,
                     guide = guide_legend(title = 'Employment Status',
                                          override.aes = list(alpha = 1, 
                                                              size = 2)),
                     direction = -1) +
  coord_cartesian(xlim = c(0,30000), 
                  ylim = c(0,max(loanData$LoanOriginalAmount))) + 
  theme_dark()

grid.arrange(p1, p2, ncol = 1)

I’ve gone ahead and plotted out two versions of the map, one with the added employment status of employed and the other without. With the added employment status of employed, with a stated monthly income of approximately $12000 or less, the loan original amount seems to be below $20000. There are added noticable horizontal lines for both graphs, most likely because of rounding.

Without the employed employment status, the picture is a bit clearer. Most part time incomes under $2000 seem to ask for less than $10000 on their loans. Self-Employed incomes on the other hand seem to ask in intervals of approximately $4000, $10000, and $15000.

Plot Three

ggplot(subset(loanData, ListYear > 2005),
       aes(x= CreditScoreAverage,
           y=BorrowerRate/10)) + 
  geom_jitter(alpha=1/8, aes(color = LoanOriginalAmount)) + 
  scale_colour_gradientn(colours=rainbow(4),
                         name = 'Loan Amount ($)') +
  geom_smooth() + 
  coord_cartesian(xlim=c(500,850)) + 
  facet_wrap(~ListYear) + 
  ggtitle("Borrower Rate by Credit Score, 
          showing loan amount, faceted by year") + 
  labs(x="Average Credit Score", y= "Borrower Interest Rate") + 
  scale_y_continuous(labels = scales::percent) + 
  guides(fill=guide_legend(title="Loan Amount"))
## `geom_smooth()` using method = 'gam'

Over time, we see an increase in the minimum credit score average. It seems that most loans in 2006 were under $10000. This seems to continue until around 2010, when you begin see higher loan amounts around $20000. It isn’t until 2013, we start seeing loan amounts reaching $30,000.

Reflection


I began this exploration with the Prosper Data set to better determine what variables seem to have the most leverage on lowering interest rate, while achieving a higher loan amount. Since the dataset didn’t have a loan amount approved variable, the efforts were focused towards borrower rate instead. It seems that credit score plays the biggest factor in achieving a lower interest rate. Interest rates seem to rise and fall through the years, and is hard to predict through this dataset alone.

The univariate analysis allowed me to determine what variables would be most suitable for my analysis. Afterwards, the bivariate analysis allows me to see correlations between the data variable pairs, allowing to better determine relationships (if any). Some interesting relationships to note: Higher prosper scores tend to lead to lower borrower rate. Reasonably, high credit scores tend to lead to lower borrower rate’s as well.

It’s hard determine what data variables really lead towards the right direction, and part of it determines on how we as the analyst tend to interpret the data and the variables as well. Often, I went back and forth between what variables make sense to illustrate versus which ones aren’t needed.

If there was more time spent on the dataset, I’d like to see if country location increases or decreases borrower rate and if number of dependents increases or decreases the percentage of completing loans. I’d also like look more deeply into credit lines, inquiries, and debt-to-income ratios overall.